10/1/2020

Agenda

  • Train-test split
  • Cross-validation
  • Bootstrap

Train-test split

Training and Test errors

Recall the distinction between the test error and the training error:

  • The test error is the average error that results from using a statistical learning method to predict the response on a new observation, one that was not used in training the method
  • In contrast, the training error can be easily calculated by applying the statistical learning method to the observations used in its training
  • But the training error rate often is quite different from the test error rate, and in particular the former can dramatically underestimate the latter
  • This problem is due to over-fitting: if we evaluate accuracy on the same data used to train the model we are cheating and the error estimate will be too optimistic

How to assess model accuracy?

\[Y = \beta_0 + \beta_1 x\]

How to assess model accuracy?

\[Y = \beta_0 + \beta_1 x + \beta_2 x^2\]

How to assess model accuracy?

\[Y = \beta_0 + \beta_1 x + \beta_2 x^2 + \beta_3 x^3\]

How to assess model accuracy?

\[Y = ???\]

How to assess model accuracy?

What if I was hiding some test data?

How to assess model accuracy?

What if I was hiding some test data?

How to assess model accuracy?

What if I was hiding some test data?

How to assess model accuracy?

What if I was hiding some test data?

Bias-variance tradeoff

How to evaluate Test error

Ideal solution: have a large designated test set to evaluate performance. This is often not available!

  • Some methods make a mathematical adjustment to the training error rate in order to estimate the test error rate. These include the \(C_p\) statistic, AIC and BIC (we will see these later in the course)
  • Here we instead consider a class of methods that estimate the test error by holding out a subset of the training observations from the fitting process, and then applying the statistical learning method to those held out observations

Validation set approach

So far we have evaluated accuracy with in-sample MSE (regression) and in-sample accuracy (classification).

Goal: estimate the prediction error of a fitted model.

A potential solution:

  1. Randomly split the data into two parts: a training set and a test (validation, hold-out) set
  2. Fit the model(s) on the training set
  3. Make predictions on the test set using the fitted model(s)
  4. Check how well you did (MSE, classification error, …)

A perfectly sensible approach, but it has still some drawbacks!

Validation set approach

A random splitting into two halves: left part is training set, right part is validation set.

Validation set approach

Drawback 1: the estimate of the error rate can be highly variable.

  • It depends strongly on which observations end up in the training and testing sets
  • Can be mitigated by averaging over multiple train/test splits, but this can get computationally expensive

Drawback 2: only some of the data points are used to fit the model.

  • Statistical methods tend to perform worse with less data
  • So the test-set error rate is biased: it tends to overestimate the error rate we would see if we trained on the full data set
  • This effect is especially strong for complex models

Validation set approach example

  • We want to compare linear vs higher-order polynomial terms in a regression
  • We randomly split the 392 observations into two sets, a training set containing 196 of the data points, and a validation set containing the remaining 196 observations

Cross validation

K fold Cross validation

It is a widely used approach for estimating test error. Randomly divide the data into \(K\) equally sized parts (\(K\) non-overlapping groups, or folds).

For fold \(k \text { in } 1:K\)

  1. Fit the model using all data points not in fold \(k\)
  2. For all points \((y_i, x_i)\) in fold \(k\), predict \(\hat{y}_i\) using the fitted model
  3. Calculate \(\text{Err}_k\), the average error on fold \(k\)

Estimate the error rate as:

\[CV_{(K)} = \frac{1}{K} \sum_{k=1}^K \text{Err}_k\]

K fold Cross validation

  • Choosing a large \(K\) requires more computation than choosing small \(K\) (the model has to be refit \(K\) times)
  • More stable (lower variance) than running \(K\) random train/test splits
  • Leave-one-out cross validation (LOOCV) is a special-case of \(K\)-fold CV with \(K = n\)

Leave one out Cross validation

Key idea: average over all possible testing sets of size \(n_{test} = 1\).

For observations \(i \text { in } 1:n\)

  1. Fit the model using all data points except \(i\)
  2. Predict \(y_i\) with \(\hat{y}_i\), using the fitted model
  3. Calculate the error, \(\text{Err}(y_i, \hat{y}_i)\)

Estimate the error rate as:

\[CV_{(n)} = \frac{1}{n} \sum_{k=1}^n \text{Err}(y_i, \hat{y}_i)\]

Leave one out Cross validation

Leave one out Cross validation

Less bias than the train/test split approach:

  • fitting with \(n - 1\) points is almost the same as fitting with \(n\) points
  • thus, LOOCV is less prone to overestimating the training-set error

No randomness:

  • estimating the error rate using train/test splits will yield different results when applied repeatedly
  • LOOCV will give the same result every time

Downside: must re-fit \(n\) times!

Leave one out Cross validation

With least-squares linear or polynomial regression, an amazing shortcut makes the cost of LOOCV the same as that of a single model fit! The following formula holds

\[CV_{(n)} = \dfrac{1}{n} \sum_{i=1}^n \left( \dfrac{y_i - \hat{y}_i}{1 - h_i} \right)^2\]

where \(\hat{y}_i\) is the \(i^{th}\) fitted value from the original least squares fit, and \(h_i\) is the leverage.

The bias-variance tradeoff (again)

Key insight: there is a bias-variance tradeoff in estimating test error.

Variance comes from overlap in the training sets:

  • In LOOCV, we average the outputs of \(n\) fitted models, each trained on nearly identical observations
  • In K-fold, we average the outputs of \(K\) fitted models that are less correlated, since they have less overlap in the training data. Generally it has less variance than LOOCV

Typical values: \(K = 5\) to \(K = 10\) (no theory; a purely empirical observation).

Cross validation for Classification problems

We divide the data into \(K\) roughly equal-sized parts \(C_1, \dots, C_K\), where \(C_k\) denotes the indices of the observations in fold \(k\). There are \(n_k\) observations in fold \(k\): if \(n\) is a multiple of \(K\), then \(n_k = n/K\).

Compute \[CV_{(k)} = \frac{1}{K} \sum_{k=1}^K \text{Err}_k\] where \(\text{Err}_k = \sum_{i \in C_k} \mathcal{I}(y_i \neq \hat{y}_i)/n_k\).

Cross-validation: right and wrong

Consider a simple binary classifier:

  • Starting with \(5000\) predictors and \(50\) samples, find the \(100\) predictors having the largest correlation with the class labels
  • We then apply a classifier such as logistic regression, using only these \(100\) predictors

How do we estimate the test set performance of this classifier? Can we apply cross-validation in step 2, forgetting about step 1?

Cross-validation: right and wrong

This would ignore the fact that in Step 1, the procedure has already seen the labels of the training data, and made use of them. This is a form of training and must be included in the validation process.

  • It is easy to simulate realistic data with the class labels independent of the outcome, so that true test error should be \(50\%\), but the CV error estimate that ignores Step 1 is zero! Try to do this yourself
  • This error has made in many high profile genomics papers

Cross-validation: wrong

Cross-validation: right

Example: k-NN for regression

Example: \(k = 1\)

Example: \(k = 50\)

Example: k-NN for regression

Example: k-NN for regression

More general framework

1. Set aside test set with \(70\%-15\%-15\%\) split

More general framework

2. Use CV on train-validation sets

Train on training set, validate on cv set to get \(\text{Err}_{CV,1}\).

More general framework

2. Use CV on train-validation sets

Train on training set, validate on cv set to get \(\text{Err}_{CV,2}\).

More general framework

3. Choose model based on cv error

Average cv errors \(\text{Err}_{CV,1}, \dots, \text{Err}_{CV,K}\)

\[\widehat{\text{Err}}_{CV} = \frac{1}{K} \sum_{k=1}^{K} \text{Err}_{CV,k}\]

We do this for several models, and choose the model with the smallest \(\widehat{\text{Err}}_{CV}\).

More general framework

4. Evaluate final model on test set

Calculate error on test set \(\widehat{\text{Err}}_{test}\). This is an unbiased estimate of the error!

The bootstrap

The goal

  • The bootstrap is a flexible and powerful statistical tool that can be used to quantify the uncertainty associated with a given estimator or statistical learning method
  • For example, it can provide an estimate of the standard error of a coefficient, or a confidence interval for that coefficient
  • It is not the same as the term “bootstrap” used in computer science meaning to “boot” a computer from a set of core instructions, though the derivation is similar

Uncertainty quantification

Say that we have some data \(\mathbf{X} = (x_1, \dots, x_n)\) and we want to estimate the location parameter \(\mu\) via the classic estimator \(\overline{\mathbf{X}}\).

How can we quantify our uncertainty for this estimate?

Question: “how might my estimate \(\overline{\mathbf{X}}\) had been different if I had seen a different sample of \((x_i)\) pairs from the same population \(P(X)\)?”

Algorithm

Bootstrap consists in the following algorithm:

For \(b\) in \(1:B\)

  1. Construct a sample from \(\widehat{P}(X)\) (called a bootstrapped sample) by sampling \(n\) observations \(x_i\) with replacement from the original sample
  2. Refit the model to each bootstrapped sample, giving you \(\overline{\mathbf{X}}^{(b)}\)

This gives us \(B\) draws from the bootstrapped sampling distribution of \(\overline{\mathbf{X}}\). Use these draws to form (approximate) confidence intervals and standard errors for \(\mu\).

Key: approximate \(P(X)\) with \(\widehat{P}(X)\)!

A simple example

We have a sample from a normal distribution, and want to quantify uncertainty around the mean parameter.

A simple example

  1. Sample \(n\) observations \((x_1^{(b)}, \dots, x_n^{(b)})\) with replacement from the original sample
  2. Compute the mean \(\overline{X}^{(b)}\)

We then have a sample of estimates \((\overline{X}^{(1)}, \dots, \overline{X}^{(B)})\).

  • We already knew the answer to this problem
  • Bootstrap can be used for estimators whose sampling distribution is unknown (any estimator, any data distribution)

More advanced bootstrap

  • In more complex data situations, figuring out the appropriate way to generate bootstrap samples can require some thought
  • For example, if the data is a time series, we cannot simply sample the observations with replacement (why?)
  • We can instead create blocks of consecutive observations, and sample those with replacements. Then we paste together sampled blocks to obtain a bootstrap dataset

Other uses of the bootstrap

  • Primarily used to obtain standard errors of an estimate
  • Also provides approximate confidence intervals for a population parameter
  • Given a sample for the parameter of interest, we can construct an approximate \(95\%\) confidence interval for the true \(\mu\)
  • This is called a Bootstrap Percentile confidence interval. It is the simplest method (among many approaches) for obtaining a confidence interval from the bootstrap

Question time